Data pipelines & Grammar of Graphics

Summary

Learning outcome L03 – interpret and compare graphical and numerical summaries using ggplot2

In this lab we will:

  • Learn to explore data to observe patterns for initial research questions;
  • combine what we have learned in Weeks 2 and 3 to create data efficient and tidy data pipelines, based on the tidyverse;
  • produce presentable boxplots, bar charts and scatter plots.

Demo

Let’s consider the iris dataset.

glimpse(iris) # like str()
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

What if this were the first time we saw this data? How would we come up with research questions? For small data (approx. < 8 columns and < 500 rows) we can visualize them directly.

Bivariate relationships

  • Can be done outside of formal report
plot(iris)

From plot() we can observe that:

  • There appear to be some linear relationships between variables
  • Separations
  • Groups
  • There are patterns of “clustering”, which can be explored (because of observation below)
  • One variable is clearly a factor or ordinal. It can be used to investigate differences due to the factor.

Univariate patterns

  • Focusing on single category data performance

Here we use ggplot2::facet_wrap() to view a histogram of all numeric variables.

iris %>%
  as_tibble() %>% # Looks better if executed only
  keep(is.numeric) %>%           # keep variables that pass `is.numeric()`
  # Factors are now gone. (Species)
  
  pivot_longer(everything()) %>% # gather all variables
  #This gives format like category(name): value(value) for all the entries in data
  
  ggplot(aes(value)) +           # plot
  geom_histogram(binwidth = 0.1) +
  facet_wrap(~ name, scales = "free") +  # view by facet
  geom_blank() # why is this useful?

From the plots we can observe that:

  • All the variables are continuous
  • Multi-modal distributions of some variables observed.

Research questions

Research question: Is there a relationship between petal length and petal width? - Yes, linear model.

iris %>%
  select(Petal.Length, Petal.Width) %>% # This can be ignored, but why keep?
  # If the data was pre-processed, like drop_na(), not selecting can make all categories drop NAs. However, this can be helpful for our analysis of targeted categories.
  drop_na() %>%
  
  ggplot(aes(Petal.Length, Petal.Width)) +
  geom_point() +
  geom_smooth(method = "lm") + # generate linear regression line
  xlab ("Petal length (cm)") +
  ylab("Petal width (cm)") +
  ggtitle("Relationship between petal length and petal width") +
  theme_classic() + # Just a theme, theme_
  geom_blank()
## `geom_smooth()` using formula 'y ~ x'

Research question: Is petal length different between species? - Yes, lots. - And boxplot and barplot are all good to go.

iris %>%
  select(Petal.Length, Species) %>%
  ggplot(aes(Species, Petal.Length)) + # add `fill =` to add color to boxes
  geom_boxplot() +
  xlab("") +
  ylab("Petal length (cm)") +
  ggtitle("Differences in petal length between species") +
  theme_minimal() +
  geom_blank()

Use summary statistics to describe the plot better!

iris %>%
  select(Petal.Length, Species) %>%
  group_by(Species) %>%
  summarise(mean = mean(Petal.Length),
            median = median(Petal.Length))
## # A tibble: 3 × 3
##   Species     mean median
##   <fct>      <dbl>  <dbl>
## 1 setosa      1.46   1.5 
## 2 versicolor  4.26   4.35
## 3 virginica   5.55   5.55

Observation: The boxplot shows that each species has a distinct range of petal length values. For example, setosa species has the smallest mean petal length (1.46 cm), which is at least 4 times smaller than virginica. The range of petal lengths for versicolor and virginica overlap.

Bonus question: Is there a better way to visualise the data?

Large data

Consider the diamonds data.

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

With more than 53,000 rows, using plot() is ill-advised.

  • Don’t do it, crush your computer and life.

However since this is a large data set we may still be able to observe patterns by randomly sampling the data to achieve at least 500 samples.

and reducing columns can help: focus on the categories that you care.

diamonds %>%
  sample_frac(0.01) %>% # sample 1% of the data to plot
  plot()

Here we can see how plot() becomes less useful when many variables are involved. This is where domain knowledge comes into play. We pick variables of interest based on what we understand about diamond properties, and the quick exploratory plot that we have visualized.

From the plots, we can observe that:

  • x, y, z are representing the same thing (size) because how perfect there linear relationships of each other.
  • there is a possible linear relationship between carat and price.
  • many variables are categorical, therefore we can explore their influence on price!
diamonds %>%
  keep(is.numeric) %>%
  pivot_longer(everything()) %>% # gather all variables
  ggplot(aes(value)) +           # plot
  geom_histogram() +
  facet_wrap(~ name, scales = "free") +  # view by facet
  geom_blank()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the histograms we can observe that:

  • price is very right-skewed, which is worth investigating
  • carat shows that high carat diamonds are rare in this dataset

Research question: What is the relationship between price and carat? Is the relationship related to clarity?

diamonds %>%
  select(price, carat, clarity) %>%
  ggplot(aes(price, carat, colour = clarity)) +
  geom_point() +
  facet_wrap(~ clarity, scales = "free_x") +
  xlab("Price (USD$)") +
  ylab("Carat") +
  ggtitle("Relationship between price and carat in `diamond` data") +
  theme_bw() +
  geom_blank()

What are your conclusions?

They are linear indeed.

Except I1, there are no significant difference in between the rest, so no, we just had one really special case.

Research question: What is the total worth of all the diamonds in this dataset, based on colour?

diamonds %>%
  group_by(color) %>%
  summarise(total = sum(price)) %>%
  ggplot(aes(color, total)) +
  geom_bar(aes(fill = color), stat = "identity") +
  scale_colour_brewer() + # viridis
  labs(fill = "Colour scale") +
  xlab("Diamond colour scale") +
  ylab("Total price (USD$)") +
  ggtitle("Total worth of all diamonds, grouped by colour") +
  theme_bw() + 
  geom_blank()

Explore: Texas housing data

We will use the txhousing data, available in ggplot2.

glimpse(txhousing)
## Rows: 8,602
## Columns: 9
## $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil…
## $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, …
## $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, …
## $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 1263…
## $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 6450…
## $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, …
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, …
## $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, …

Domain knowledge

What is the background to the txhousing data? Hint: use ?txhousing, and consider previewing the data by plotting. Don’t forget to downsample data that contains more than 500 samples/rows.

Explore

  • Investigate the structure of the data
  • Find interesting observations with visualisations and numerical summaries.
  • Come up with several research questions to answer using ggplot2. Don’t forget to add labels and refine legends.
  • Use numerical summaries to support your findings.

Project 1

  • How is progress?
  • Finalise your research questions and refine your plots based on what you have learned in this Lab
  • Record your presentations early so that someone can edit the videos (if necessary).
  • Any final questions?